1 Introduction

Here we present a summary of processing steps on WASH dataset.

  • The aim is to investigate the contribution of demographic, social or economic factors to improved water sanitation and hygien (WASH) among the urban poor.



1.1 Activities

  • Three WASH variables were created as per WHO definition (Damazo). See codebook for variable labels.

    • cat_watersource

    • cat_toilettype

    • cat_garbagedisposal

  • For every variable, cases with NIU or Missing: Impute were recoded to NA.

  • Cases which had NA in Gender and Age were completely dropped.

  • Combined smaller groups to others.

  • A composite variable was created from the three was variables using logistic PCA.

    • PCs scores were used to create categories (quantiles).
  • Centered the Household total expenditure.

  • For every case, we summed the number of WASH indicators the had access to (max = 3) and calculated the proportion (No sure how to call this rate) Is it possible to model the total as poisson process?

  • Visualization plots for individual WASH were created but initial modelling is on composite WASH variable.

  • We also present the result from Generalized Linear Mixed-effect Model using lme4 package (glmer).



1.2 Proposed modelling approaches

  • Use scoring approaches e.g., PCA to create composite WASH variable and then apply GLMM.

  • Apply multivariate mixed models; either using pseudo multivariate approach in (glmer) or use other approaches proposed by Samuel.

  • Assume equal weights for each of the WASH indicator variables and model as a count data. We could use Poisson or Negative Binomial.

  • Model them separately.

  • Any other suggestions?

1.3 Codebook



2 Data Exploration

2.1 Missingness

The table below summarizes the proportion of missingness for all the variables.

  • A total of 0 variables which had \(100\%\) missingness were dropped.



2.2 Descriptives

We begin by showing the distribution of individual WASH variables (indicators) over time and space (slum area). Thereafter, we show the distribution of demographic, social and economic variables, of interest, based on composite WASH variable.

2.2.1 Water source



2.2.2 Toilet type



2.2.3 Garbage disposal



2.2.4 Composite WASH indicator variable



2.2.4.1 Composite WASH indicator and Gender



2.2.4.2 Composite WASH indicator and Age



2.2.4.3 Composite WASH indicator and Ethnicity



2.2.4.4 Composite WASH indicator and Hunger scale



2.2.4.5 Composite WASH indicator and Poverty line



2.2.4.6 Composite WASH indicator and Wealth quintile



2.2.4.7 Composite WASH indicator and Total household expenditure



2.2.5 Proportion of WASH indicators the respondent had access to.



3 Data Analysis

3.1 Generalized Linear models

In order to gain some understanding before engaging into a more complex mode, we simulated ‘fake’ response variable.

  • To estimate \(\beta\)s, a logistic regression model (glm) was used on the observed response variable.
  • We then ran \(1000\) simulations and for each simulation, calculate:

    • \(\mathbf{X\beta}\)
    • \(p = \frac{1}{1 + \exp(-\mathbf{X\beta})}\)
    • ‘Fake’ y

      • y = rbinom(n, 1, p)
    • Average ‘Fake’ y to obtain the proportion of 1s generated

The first model, separately, fitted each of the wash variable:

\[\begin{align} single\_wash\_var &\sim intvwyear + slumarea + ageyears\\ & + gender + ethnicity + numpeople\_total + isbelowpovertyline\\ & + wealthquintile + expend\_total\_USD\_per\_centered \end{align}\]

The second model: Restructured the data into long format fitted the model on a single indicator variable.

\[\begin{align} wash\_indicator &\sim (intvwyear + slumarea + ageyears\\ & + gender + ethnicity + numpeople\_total + isbelowpovertyline\\ & + wealthquintile + expend\_total\_USD\_per\_centered) * wash\_variable_label \end{align}\]

We then used the estimated \(\beta\) (coeffs.) to simulate ‘fake’ y.

  • Merged with the labels
  • Distribution similar to the individual fitting

3.2 Generalized Linear Mixed Models